Using Boolean Rule Extraction for Taxonomic Text Categorization for Big Data
نویسندگان
چکیده
Categorization hierarchies are ubiquitous in big data. Examples include MEDLINE’s Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a category is very small at lower levels of the hierarchy. Blindly applying a standard categorization model is likely to perform poorly if this fact is not taken into consideration. This paper introduces a novel technique for text categorization, called Boolean rule extraction, which enables you to effectively address this situation. In addition, models that are generated by this introduced rulebased technique can be easily interpreted and modified by a human expert, enabling better human-machine interaction. The Text Rule Builder node and the newly developed HPBOOLRULE procedure in SAS® Text Miner implement this technique. The paper demonstrates how to use the HPBOOLRULE procedure to obtain effective predictive models at various hierarchy levels in a taxonomy. INTRODUCTION Hierarchies are becoming ever more popular for indexing and organizing large text corpora that contain millions or even billions of documents. Well-designed hierarchies are important for enterprise-level content categorization. The prevalence of complicated large-scale hierarchical taxonomies has generated a pressing need for the development of automated hierarchical text categorization. Most state-of-the-art hierarchical text categorization techniques fall into two classes of models: the top-down model and the flat model. A top-down model builds a classifier for each node of the hierarchy, and a document is classified as a member of a leaf category in the hierarchy if and only if it is also classified as a member of all the ancestor categories of the leaf category.1 On the other hand, a flat model constructs a binary categorization problem2 for each leaf category of the hierarchy, and a document is classified as a member of the leaf category if the prediction of the classifier that is built for the corresponding binary categorization problem is positive. A flat model enjoys certain advantages over a hierarchical model. First, implementation of a flat model is simple, and most existing classifiers can be used directly as base building blocks. Second, despite its simplicity, a flat model might generate more accurate results in practice because it does not suffer the error propagation problem of a hierarchical model. When a document is classified wrongly on an upper level of a hierarchical model, the error is propagated to its descendant categories. However, a flat model faces training difficulties. First, because it does not use a divide-and-conquer approach to solve a problem as a hierarchical model does, each of its base classifiers is usually exposed to the full text corpus. Being exposed to the full text corpus requires the base classifier to be highly efficient, especially when the size of the text corpus is large. Second, the base classifiers of the flat model usually face the issue of data sparsity: many taxonomies consist of a very large number of leaf categories, and the number of documents in a leaf category can be quite small even in a huge corpus. Because the binary categorization problem is usually constructed by considering all documents that belong to the category as positive and all documents that do not belong to the category as negative, there is a very high imbalance between positive and negative samples for the constructed problem. This issue requires the base classifier 1 A leaf category is a node that has no children. 2 Hierarchy information can still be incorporated by constructing the binary categorization problem in certain ways.
منابع مشابه
A Systematic study of Text Mining Techniques
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of in...
متن کاملDEFINDER: Rule-based Methods for the Extraction of Medical Terminology and their Associated Definitions from On-line Text
INTRODUCTION The problem addressed in this paper concerns the automatic identification and extraction of medical terms along with their definitions and modifiers from full text consumer-oriented medical articles. The system, DEFINDER (Definition Finder), uses rule-based techniques. The output of our system can be used in several applications: creation and/or enhancement of on-line terminologica...
متن کاملDesign and Test of the Real-time Text mining dashboard for Twitter
One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...
متن کاملText Mining
“Bag of words” model, acronym extraction, authorship ascription, coordinate matching, data mining, document clustering, document frequency, document retrieval, document similarity metrics, entity extraction, hidden Markov models, hubs and authorities, information extraction, information retrieval, key-phrase assignment, key-phrase extraction, knowledge engineering, language identification, link...
متن کاملApplications of Natural Language Processing in Biodiversity Science
Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015